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Abstract. Three algorithms of Gram-Schmidt type are given that produce 
an orthogonal decomposition of finite d-dimensional symmetric, alternating, or 
Hermitian forms over division rings. The first uses d 3 /3+0(d?) ring operations 
with very simple implementation. Next, that algorithm is adapted in two new 
directions. One is an optimal sequential algorithm whose complexity matches 
the complexity of matrix multiplication. The other is a parallel NC algorithm 
with similar complexity. 



1. Introduction 

The classic Gram-Schmidt 'orthogonalization process' returns an orthonormal 
basis of an inner product space. Here we generalize that process in the appropriate 
fashion to Hermitian forms over division rings A. For us a Hermitian A-form is 
a function b : V x V — > A on a finite-dimensional A-vector space V where b is 
linear in the hrst variable and for some anti-isomorphism a of A, for all u,v G V, 
b(u,v) = b(v,u) a . This captures the usual symmetric and skew-symmetric forms 
as well as the traditional Hermitian forms; cf [10]. We identify V with a space 
of row vectors and so describe b by a matrix B such that b(u,v) = uBv (7t . The 
assumptions on b force B — 0, or B — sB at with s = ±1 and a 2 — 1. To change the 
basis we use an invertible matrix A and observe b(uA, vA) — uABA crt v at . Hence, a 
fully refined orthogonal decomposition for b is captured by a matrix A under which 

ABA 171 is nearly diagonal, nearly in that sometimes J .- 



-1 



is required. 



Theorem 1. Let A be a division ring, let s = ±1. and let a be a unital anti- 
isomorphism of A with a 2 = 1 . There are deterministic algorithms that, given a 
(d x d) -matrix B = sB at , return an invertible (d x d) -matrix A such that 

(1.1) ABA at = B 1 © • ■ • © B m 

and each Bi is either lxl or J . 

(i) The first algorithm uses e inversions, {^j equality tests, e 3 /3 + 0(d 2 ) additions 
and e 3 /3 + 0(d 2 ) multiplications in A, where e = d — r and r is the rank of 
the null space of B. 

(ii) The second algorithm returns a straight line program to A using 0{d u} ) oper- 
ations in A, where u is the exponent of matrix multiplication. 
(Hi) The third algorithm is parallel NC 3 in an arithmetic model, that is, it uses 
0(log d) operations in A on d°^ processors. 
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Theorem [T] in part proves that Hermitian forms over division rings have a decom- 
position of the type described in For fields this is well-known, e.g. [U Theo- 
rems 3.7], but most proofs begin by classifying forms into subclasses, e.g. symmetric 
if s = 1 and er = 1, skew-symmetric if s = — 1 and er = 1, or Hermitian if a ^ 1. 
Nice bases are constructed by individual arguments for each case. Here we find 
a single argument allows for uniform optimal asymptotic and parallel algorithms 
without dependence on A. 

The idea behind Theorem [TJi) is shared by many generalizations of Gram- 
Schmidt. For symmetric forms it goes back at least to Smiley's Algebra of Matrices 
[H Section 12.2] and is adequately described as symmetric Gaussian elimination. 
Dax and Kaniel [3] give a detailed analysis of such an algorithm for symmetric 
forms. Holt and Roney-Dougal [4] use the method in a case- by-case algorithm for 
Hermitian forms over finite fields. I was also gratefully alerted to a predecessor to 
Theorem HJii) that applies to symmetric forms over fields; see Theorem 16.25]. 

The algorithms for Theorem Q] parts (ii) and (iii) settle the complexity of finding 
an nice basis for a generic Hermitian form, but these may not be best suited for 
certain applications. First, they depend on data structures for fast matrix mul- 
tiplication which may provide an undesirable overhead in small dimensions. The 
exact cross-over dimension is an issue of ongoing research; see [111 p. 313]. Further- 
more, our algorithms assume exact field operations, such as in algebraic number 
fields, rational Quaternion division rings, or finite fields. We make no claims about 
their numerical stability in fields with floating point approximations. In such cases 
consider [6]. 

Each of our algorithms allows the user to choose a computational encoding for 
A, such as by polynomials or matrices over a field. If no alternative suggests itself, 
an adequate method is to encode A by structure constants over its center; see [3 p. 
223]. Also, b can be encoded as a "black-box"; however, we will eventually evaluate 
b on all unordered pairs from a fixed basis for V and so it simplifies our description 
to assume that b is input by a (d x d)-matrix B — sB at with s = ±1 and a 2 = 1. If 
the s and a are not specified with B, then suitable values can be detected during 
the execution of the algorithms for Theorem Q] We either prove that B = or 
we find u,v € V such that b(u,v) — uBv at ^ 0. The first such pair u, v € V 
determines a by a : a i-> b(u, av)b(u, w)^ 1 , and s — b{v,u)~ l ■ b(u,v) a . We write 
the algorithm as though s and a are known. 



2. Smiley's method 

Let us start with the algorithm Decompose which is not asymptotically optimal, 
but which (I believe) is the simplest to implement and captures all Hermitian forms 
at once. This is the prototype for the optimal sequential and parallel algorithms 
given later. 

If A is an elementary matrix then ABA at modifies B in one of three ways. First, 
if A is a diagonal matrix with l's on the diagonal except for A in entry i, then 
ABA at scales row i by A, and column i by A CT . For instance: 
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We describe that as scaling row-column i by X. Second, if A is a transposition of 
i and j then ABA at has the entries from B with rows i and j swapped as well as 
columns i and j swapped. We call this swapping row-column i with row-column j. 
That does not involve operations in A. Thirdly, if A = I + A-Ey then ABA at has 
the effect of adding A times row i to row j and A CT times column i to column j, as 
illustrated below. 
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That implicitly involved the fact that entries j3 on the diagonal satisfy j3 = s(3 a . 
To clear a row-column means to use a selected non-zero entry j in a row-column i 
of B, and use successive multiplications by I + XkEki, for k E {1, . . . , d} — {i} to 
set all other entries in the row-column i to zero. This is possible whenever i = j 
or Ba — 0. Using the symmetry of the matrices B — sB at , clearing a row-column 
uses d 2 + 0(d) additions, d 2 + 0(d) multiplications, d applications of a, and one 
inversion. 

We use upper case Roman letters for block sub-matrices and lower case Greek 
letters for coefficients in A. We also assume that the associated matrix A which 
transforms B into the return ABA 17 * as in is evident form the operations 

described, and so we do not explicitly include A in the description of the algorithm. 



Standardize I B = 

If a ± 0, set A = 



1 

s a 
—aT 
\ 



and s = 1 ^ — 1, set A = 
return B. 



and return ABA* 1 = [-sa^ 1 ] [a]. If a = 
1 1 



-1 



and return ABA at = [2] © [-2]. Else 



Decompose( B E M d (A) : B = sB at ): 
(I) [Base case] If d < 1 return B. 

(II) [Anisotropic case] If B\\ = f3 ^ 0, then use that entry to clear the 

"/3 (T 



remaining non-zero entries of row-column 1. Now B 



B' 



Return [/3] © Decompose^'). 
(Ill) [Isotropic case] Else, if B\2 =7^0 (after a possible swap of a 

7 * 
S7 <T a * 



row-column), i.e. B 



then scale row-column 2 by 7 1 and 



2. Now B 



excluding B22, use B12 to clear row-column 1 and B21 to clear row-column 

~B' 0] [0 1" 

n „„ where £? = 
is s a 

Return Standardize^') © Decompose(_B"). 

"0 0' 



(IV) [Radical case] Else, B = 



B' 



so return [0] © Decompose^'). 
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Proof of Theorem\])(i) . The algorithm Decompose returns a block diagonal matrix 
whose blocks are as in (|1.1J . That algorithm only modifies the entries of B so that 
the space complexity is 0{d 2 ) elements in A. 

Now we consider the time complexity. There are at most d equality tests to 
decide on the correct case to enter. The anisotropic case clears one row-column 
and recurses on a matrix of dimension d — 1. The isotropic case clears two row- 
columns, performs some multiplications of (2 x 2)-matrices, and recurses on a matrix 
of dimension d— 2. Finally, the radical case simply recurses on a matrix of dimension 
d — 1. Hence, if T(d) is the number of additions performed by the algorithm, then 
T(d) e 2d 2 + T(d - 2) + 0(d). If r is the dimension of the radical and e = d — r, 
then T(d) 6 e 3 /3 + 0(d 2 ). The algorithm uses the same number of multiplications, 
(2) equality tests, e inversions, and (2) — (2) applications of a. □ 



3. Optimal and parallel methods 

Multiplication of (d x c?)-matrices by the traditional algorithm is not the most 
efficient method for large dimensions. The various new methods use 0(d u) ) oper- 
ations in A for some 2 < oj < 3 11, p. 315]. Here we prove the same complexity 
for finding a decomposition as in (|1.1[) . Biirgisser et. al. give an example of a sym- 
metric (d x d)-matrix over a field where the complexity of finding an orthogonal 
basis is 0(d u ) (provided that cj > 2) [2, Theorem 16.20] and so the complexity in 
Theorem QJii) is best possible in general. 



Proof of Theorem\]^ii) . The algorithm DecomposeByBlocks suffices as described 
so it remains to analyze the time complexity of the algorithm. 

We start by detecting the radical of B. This amounts to solving for a basis of 
the null space of B. That has complexity of 0(d u ) [§J Theorem 2]. To create B" 
requires 2 matrix multiplications and d 2 applications of a. Thus the radical case 
uses 0(d u ) operations in A. The algorithm never re-enters this case. 

In the block anisotropic case we solve for a null space on a \d/2] -square matrix 
B', and multiply two (d x <i)-matrices, in (|3.1|) . Let / be the rank of the null space 
of B' . At this point we have two cases. If / = \d/2] we exit the block anisotropic 
case and enter the block semi-hyperbolic case; otherwise, we to create Y (we invert 
and multiply a (([d/2] — /) x ([d/2] — /))-matrix), multiply two (d x (i)-matrices 
in (|3.2p . We then make one recurse call to the block anisotropic case for B" , and 

X~ 

one call to the block semi-hyperbolic case for ^ at ^ where X has / rows and 

Z is (d — \d/2~\) x (d — \d/2~\). Ignoring the recursions, the anisotropic case uses 
0(d u ') operations in A. 

The block semi-hyperbolic case takes in a (d x d)-matrix partitioned by into 
(/, d — /)-blocks. We compute a null column space of an (/ X (d — /))-matrix, 
multiply 2 (d x <i)-matrices ()3.3|) ( f)3.4[) requires no computation), and we also 
multiply two (2/ x 2/)-matrices in (|3 . 5[) . Finally there are at most d/2 applications 
of Standardize and a recursive call on a ((d — 2f) x (d — 2/))-matrix B' . All this 
amounts to 0(d UJ ) operations in A before the recursion. 

Now we estimate the total cost. Let T a (d) be the cost of the block anisotropic case 
for an input of dimension d, and Th(d, f) the cost of the block semi- hyperbolic case 
for an input of dimension d where X has /-rows. For some constants C a , C'h > 0, 
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DecomposeByBlocks( B e Aid (A) : B = sB at ) : 

(I) [Detect Radical] Compute an invertible A such that AB 

B' has full row rank. Now ABA at 



B" 




where 

with B" nonsingular. Apply 



step (HH to B" . 

(II) [Block Anisotropic case] Here B is a nonsingular (d x <i)-matrix. If 

\B' *" 



d < 1, halt; else take B = 



with B' e Mp d / 2 ] (A). Find A such that 



AB'A CTt = 



B" 0' 




and B" has full rank (as in (I)). Compute 
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If B" has dimension then (as B is nonsingular) W is nonsingular; proceed 
to step HU). Otherwise, 5" = s(B") at is nonsingular. Set 
y = -sC^iB")- 1 and compute 
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Note X has full row rank since B is nonsingular. Apply step |TT]) to -B", 

Al 



and apply step (JTTTJ) to 



sA c 



then halt. 



and X has full rc 



rank. 



X 
sX at * 

Compute an invertible matrix A such that XA — [C 0] where C has full 
column rank; thus, C is invertible. Compute 



(III) [Block Isotropic case] Now B = 



(3.3) 



"c- 1 " 




X 




'c- CTt 0" 


A at 




sX at * 




A 



Observe that: 
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and decompose Z = sZ° 
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. So 7) is 


diagonal and 



upper triangular with entries on the diagonal. So D is diagonal and 
D = sD a . Reset B" to be 



(3.5) 



" 7 0" 




"0 


I' 




I 


_u at ~ 




'0 7" 


-U I 




si 


Z 







I 




si D 



Sort the row-columns so that the matrix is in the form 

• ■ • © with a.i G A. Apply Standardize to each of those 

S OL\ S Ctf 

blocks. Finally, B' is nonsingular so apply step ((TTJ) to B' , then halt. 
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T h {d, f) <T a {d- 2f) + C h d", and 

T (d) < max{ T h (d, d/2), T a {d/2 - f) + T h (d/2 + f, /)) } + C a d u 

< 2T a (d/2 - /) + C h (d/2 + fY + (C a + C h )d" 

<2T a {d/2) + {C a +2C h )d u . 
Thus, T(d) G O(cF). 



□ 



Proof of Theorem\^iii) . DecomposeByBlocks uses 0(log d) recursive calls and each 
step can use the parallel NC 2 (i.e. 0(log d)) linear algebra algorithms of [SJ 
Sections 3.8, 4.5] and [9j Section 2.3] to find null-spaces and multiply matrices in 
an arithmetic model. (Those methods make it possible to trade on time efficiency to 
reduce the number of required processors, which is of importance in practice.) □ 

4. Post-processing adjustments 

In our algorithms we opted for a decomposition of B which is as close to diagonal 
as possible so that the associated basis is nearly orthogonal. It is also common to 

want a decomposition with as many blocks of the form J = 







as possible. 



The algorithm can be tuned in that direction by modifying Standardize and by 
converting various (1 x l)-blocks into J's at the end of the algorithm. The details 
are analogous to those used in Standardize. 

In some cases a canonical return is possible with a few adjustments. For example, 
the block [a] can be adjusted to [70:7"'] for / 7 £ A. Hence, if a = 7~ 1 7~ <T for 
some 7 G A — {0}, then we may replace a with 1. Computationally finding 7 to 
perform this adjustment can be involved. Already when a = 1 and A a field this 
amounts to finding a square-root of a. If the number of classes in A of the form 
70:7°' is linearly ordered, it is possible to sort the (1 x l)-blocks accordingly. 

Another situation for modification tries to convert multiple (1 x l)-blocks. For 
instance, if A is a field and 0^a = 77°" + 85 a for some 7, 6 G A (for example, if 
<T = 1 and a is a sum of squares) , then 
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Similarly, if the characteristic is 2 and a^O then 
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